All provided data has been presumed as truth (for example, some
“home” games have been played at secondary locations, including TOR’s
entire 2020-21 season. These are not reflected in the data and I was
told not to account for this.) Note that the OKC and DEN 2024-25
schedules in schedule_24_partial.csv intentionally include
only 80 games, as the league holds 2 games out for each team in the
middle of December due to unknown NBA Cup matchups. I was told not to
assign specific games to fill those two slots.
Note:
Throughout this document, any season column
represents the year each season started. For example, the 2015-16 season
will be in the dataset as 2015. We may refer to a season by just this
number (e.g. 2015) instead of the full text (e.g. 2015-16).
Question 1: 26 4-in-6 stretches in OKC’s draft schedule.
Question 2: 24.6 4-in-6 stretches on average.
Question 3:
Question 4: This is a written question. Response is in the document under Question 4.
Question 5:
Work is shown in the document.
Question 8:
library(tidyverse)
schedule <- read_csv("C:/Users/elias/Downloads/Technical Project OKC/schedule.csv")
draft_schedule <- read_csv("C:/Users/elias/Downloads/Technical Project OKC/schedule_24_partial.csv")
locations <- read_csv("C:/Users/elias/Downloads/Technical Project OKC/locations.csv")
game_data <- read_csv("C:/Users/elias/Downloads/Technical Project OKC/team_game_data.csv")
In this section, you’re going to work to answer questions using NBA scheduling data.
QUESTION: How many times are the Thunder scheduled to play 4 games in 6 nights in the provided 80-game draft of the 2024-25 season schedule? (Note: clarification, the stretches can overlap, the question is really “How many games are the 4th game played over the past 6 nights?”)
# First filter out the Nuggets games to focus on the Thunder's schedule.
library(dplyr)
okc_schedule <- draft_schedule |>
filter(team == "OKC")
# Compare the dates of the first and last games of every 4 game stretch.
# Sum up the number of instances where the first game of the 4 game stretch
# took place 5 days before the last one. These are the 4-in-6 stretches
okc_schedule |>
mutate(num_days = gamedate - lead(gamedate, n = 3)) |>
summarize("Number of 4-in-6 Stretches" = sum(num_days == 5, na.rm = TRUE))
## # A tibble: 1 × 1
## `Number of 4-in-6 Stretches`
## <int>
## 1 26
ANSWER 1:
26 4-in-6 stretches in OKC’s draft schedule.
QUESTION: From 2014-15 to 2023-24, what is the average number of 4-in-6 stretches for a team in a season? Adjust each team/season to per-82 games before taking your final average.
# Note: the prompt solely asks about 4-in-6 stretches, so 4-in-5 stretches from
# before 2018 are not counted or included
# Count the total 4-in-6 stretches of every team then divide to find the per 82 games
# average for every team. Then take the mean of each team's average.
schedule |>
group_by(team) |>
mutate(num_days = gamedate - lead(gamedate, n = 3)) |>
summarize(avg_4in6 = sum(num_days == 5, na.rm = TRUE) / (length(gamedate) / 82)) |>
summarize("Average Number of 4-in-6 Stretches for a Team in Season" = mean(avg_4in6))
## # A tibble: 1 × 1
## `Average Number of 4-in-6 Stretches for a Team in Season`
## <dbl>
## 1 24.6
ANSWER 2:
24.6 4-in-6 stretches on average.
QUESTION: Which of the 30 NBA teams has had the highest average number of 4-in-6 stretches between 2014-15 and 2023-24? Which team has had the lowest average? Adjust each team/season to per-82 games.
# Reuse code from Q2 for each team's average.
team_averages <- schedule |>
group_by(team) |>
mutate(num_days = gamedate - lead(gamedate, n = 3)) |>
summarize(avg_4in6 = sum(num_days == 5, na.rm = TRUE) / (length(gamedate) / 82))
# Find the highest and lowest averages and the corresponding teams.
# Paste together teams and values into one string for cleaner printing
# Ensure the two averages are rounded to the same digit
team_averages |>
summarize("Highest Average Team" =
paste(team[which.max(avg_4in6)],
" (", round(max(avg_4in6), digits = 4), ")", sep = ""),
"Lowest Average Team" =
paste(team[which.min(avg_4in6)], " (", min(avg_4in6), ")", sep = ""))
## # A tibble: 1 × 2
## `Highest Average Team` `Lowest Average Team`
## <chr> <chr>
## 1 CHA (27.7125) NOP (21.8325)
ANSWER 3:
QUESTION: Is the difference between most and least from Q3 surprising, or do you expect that size difference is likely to be the result of chance?
# Check normalcy of distribution
# Use histogram and summary statistics
averages <- team_averages$avg_4in6
hist(averages, breaks = 8)
summary(averages)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.83 23.49 24.81 24.58 25.72 27.71
# Calculate z-scores
z_score_max <- max(scale(averages))
z_score_max
## [1] 2.04838
z_score_min <- min(scale(averages))
z_score_min
## [1] -1.792808
# Use IQR Method to test if max and min appear to be outliers
q1 <- quantile(averages, 0.25)
q3 <- quantile(averages, 0.75)
iqr = q3 - q1
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
as.logical(min(averages) > lower_bound & max(averages) < upper_bound)
## [1] TRUE
# Use Grubbs' Test to test likelihood that max and min are outliers
library(outliers)
#Grubbs' Test for maximum
grubbs.test(averages, type = 10)
##
## Grubbs test for one outlier
##
## data: averages
## G = 2.04838, U = 0.85033, p-value = 0.5203
## alternative hypothesis: highest value 27.7124842370744 is an outlier
#Grubbs' Test for minimum
grubbs.test(averages, type = 10, opposite = TRUE)
##
## Grubbs test for one outlier
##
## data: averages
## G = 1.79281, U = 0.88535, p-value = 1
## alternative hypothesis: lowest value 21.8325 is an outlier
ANSWER 4:
The difference between most and least from Q3 is not surprising and that size difference is likely to be the result of chance. Having confirmed the normality of the distribution, z-scores showed that the maximum and minimum were both around 2 standard deviations away from the mean, well within the range of values expected in a normal distribution. Additionally, using the IQR method, both the maximum and minimum fell within the range of expected values and would not be considered outliers. Lastly, Grubbs’ Test returned high p-values for both the maximum and the minimum values, meaning that both are unlikely to be outliers. Overall, each of the multiple methods used all returned the same answer that the size difference between the most and least from Q3 are likely to be the result of chance.
QUESTION: What was BKN’s defensive eFG% in the 2023-24 season? What was their defensive eFG% that season in situations where their opponent was on the second night of back-to-back?
# Create a function to calculate eFG%
# Inputs: field goals made, 3 point field goals made, field goals attempted
# Outputs: eFG%
e_fg <- function(fgm, fgm3, fga) {
efg <- (fgm + (0.5 * fgm3)) / fga
return(efg)
}
# For BKN defensive eFG% filter to defensive entries for the correct team and season.
# Then sum the appropriate columns and enter the totals eFG% function.
# Multiply by 100 to convert from decimal to percentage.
game_data |>
filter(def_team == "BKN" & season == 2023) |>
summarize("BKN Defensive eFG%" = e_fg(sum(fgmade), sum(fg3made), sum(fgattempted)) * 100)
## # A tibble: 1 × 1
## `BKN Defensive eFG%`
## <dbl>
## 1 54.3
# Separate out each team's schedule and arrange by game date.
# Filter to games one day after the previous game where the opponent is BKN.
# Use eFG% function and convert from decimal to percentage.
game_data |>
group_by(off_team) |>
arrange(gamedate) |>
filter(season == 2023 & gamedate - lag(gamedate) == 1 & def_team == "BKN") |>
ungroup() |>
summarize("BKN Defensive eFG% (Opponent on Second Night of Back-to-Back)" =
e_fg(sum(fgmade), sum(fg3made), sum(fgattempted)) * 100)
## # A tibble: 1 × 1
## `BKN Defensive eFG% (Opponent on Second Night of Back-to-Back)`
## <dbl>
## 1 53.5
ANSWER 5:
This is an intentionally open ended section, and there are multiple approaches you could take to have a successful project. Feel free to be creative. However, for this section, please consider only the density of games and travel schedule, not the relative on-court strength of different teams.
QUESTION: Please identify at least 2 trends in scheduling over time. In other words, how are the more recent schedules different from the schedules of the past? Please include a visual (plot or styled table) highlighting or explaining each trend and include a brief written description of your findings.
# Examining the average number of days of rest between each game.
library(ggplot2)
# Separate each team's schedule each season into groups
# Order by game date and then calculate the number of days of rest before each game
# Account for break in season due to COVID-19
# Put all team averages together for league average each year
schedule2 <- schedule |>
group_by(season, team) |>
arrange(gamedate) |>
mutate(days_rest = gamedate - lag(gamedate) - 1,
days_rest = replace(days_rest, days_rest > 30, NA)) |>
ungroup(team) |>
summarize("Average Days of Rest" = mean(days_rest, na.rm = TRUE))
# Plot the change in average days of rest between each game over time
ggplot(data = schedule2) +
geom_col(aes(x = season, y = `Average Days of Rest`), fill = "salmon") +
labs(title = "Average Days of Rest Between Games: 2014-2015 to 2023-2024", x = "Seasons") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.
# Examining the average number of consecutive away games in between each home game.
# Separate each team's schedule each season into groups.
# Add a column to assign a unique denotation to each stretch of away games. Do this
# by tracking the first game of each stretch of away games.
# Then count the number of games in each stretch of away games.
# For each season take the average number of away games in between each home game.
schedule3 <- schedule |>
group_by(season, team) |>
arrange(gamedate) |>
mutate(roadtrip = cumsum(home == 0 & (lag(home) == 1 | is.na(lag(home))))) |>
filter(home == 0) |>
group_by(season, team, roadtrip) |>
summarize(num_games = n()) |>
ungroup(team) |>
summarize(average_length = mean(num_games))
## `summarise()` has grouped output by 'season', 'team'. You can override using
## the `.groups` argument.
# Plot the change in average number of games per away stretch over time
ggplot(schedule3) +
geom_col(aes(season, average_length), fill = "lightblue") +
geom_smooth(aes(season, average_length), method = lm, color = "black", se = FALSE) +
theme_minimal() +
labs(title = "Average Number of Games per Away Stretch: 2014-2015 to 2023-2024",
x = "Seasons", y = "Average Number of Consecutive Away Games") +
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
ANSWER 6:
Looking at the average number of days of rest between each game, clear shifts in league schedule design are evident. 2014-2015 to 2016-2017 can be considered the baseline prior to any changes. Then, from 2017-2018 to 2019-2020 there was an increase in the average number of days of rest between each game. 2020-2021 was an anomaly due to the condensed post-COVID schedule, meaning that teams had the fewest days out of any season. Then, lastly, from 2021-2022 to 2023-2024 the league reversed slightly and reduced the number of days of rest compared to 2017-2018 to 2019-2020, while still retaining more days of rest than in 2014-2015 to 2016-2017.
Additionally, when examining the average number of games in each away stretch between home games, another clear pattern emerges. From 2014-2015 to 2023-2024 there is a steady increase in the average amount of games in each away stretch, reflecting changes in schedule design. Notably, 2019-2020 saw the fewest averages games per away stretch and 2020-2021 saw the most average games per away stretch. These exceptions to the general trend can be attributed to the 2020 NBA bubble and the 2020-2021 modified condensed season.
QUESTION: Please design a plotting tool to help visualize a team’s schedule for a season. The plot should cover the whole season and should help the viewer contextualize and understand a team’s schedule, potentially highlighting periods of excessive travel, dense blocks of games, or other schedule anomalies. If you can, making the plots interactive (for example through the plotly package) is a bonus.
Please use this tool to plot OKC and DEN’s provided 80-game 2024-25 schedules.
ANSWER 7:
# Modify draft schedule to add latitude, longitude, number of days between
# the game and the prior game, and distance between the location and prior location.
# Create difficulty score primarily weighing rest while also accounting for travel distance
# Modify IST gap in December to estimate 3 days between the last IST game
# and the next game.
# Deal with NA value in first game for days of rest and distance columns.
library(geosphere)
## Warning: package 'geosphere' was built under R version 4.4.3
schedule_2 <- draft_schedule |>
group_by(team, opponent) |>
#Add longitude and latitude
mutate(longitude = if_else(home == 0, locations$longitude[locations$team == opponent[1]],
locations$longitude[locations$team == team[1]]),
latitude = if_else(home == 0, locations$latitude[locations$team == opponent[1]],
locations$latitude[locations$team == team[1]])) |>
ungroup(opponent) |>
mutate(# Add days between games
days_between = gamedate - lead(gamedate),
days_between = replace(days_between, days_between > 8, 3),
days_between = replace(days_between, is.na(days_between), 8),
# Add distance between games
prev_longitude = lead(longitude),
prev_latitude = lead(latitude),
distance = distHaversine(pick(prev_longitude, prev_latitude), pick(longitude, latitude)) / 1609.344,
distance = replace(distance, is.na(distance), 0),
# Create a difficulty score primarily weighting rest while also
# accounting for travel
difficulty_score = as.numeric(8 - days_between + (distance / 2000))) |>
# Remove extraneous columns and order by date
select(-prev_longitude, -prev_latitude) |>
arrange(gamedate)
# Create separate draft schedules for each team
okc_schedule2 <- schedule_2 |>
filter(team == "OKC")
den_schedule2 <- schedule_2 |>
filter(team == "DEN")
# This tool will create a travel map with the paths colored based on the amount
# of days since the last game, as well as a tracker of the difficulty score
# of the games (weighting rest the most and travel distance some as well).
# NOTE: The code is quite long since ggplot does not support coloring individual
# line segments according to a factored variable in geom_path(). So to see that
# visual I had to manually code it.
# Input the schedule as modified above as well as the locations data.
plotting_tool <- function(schedule, locations) {
library(maps)
library(plotly)
library(zoo)
# This function is required later in the program to isolate games with a
# certain number of days between the prior game and the game itself,
# as well as the games prior to each of them.
n_days_of_rest <- function(schedule, n) {
insert_row_every_3rd <- function(df, new_row_data) {
# Ensure new_row_data has the same column types
if (!is.data.frame(new_row_data)) {
new_row_data <- as.data.frame(t(new_row_data))
}
colnames(new_row_data) <- colnames(df)
# Initialize an empty data frame to store the result
result_df <- data.frame()
# Loop through the original data frame and insert rows
for (i in 1:nrow(df)) {
result_df <- rbind(result_df, df[i, ])
if (i %% 2 == 0) {
result_df <- rbind(result_df, new_row_data)
}
}
return(result_df)
}
# Merge the games of specific rest and the games immediately prior to each
# of them.
merge <- function(schedule, n) {
games <- filter(schedule, schedule$days_between == n) |>
ungroup()
games_prior <- filter(schedule, lead(schedule$days_between) == n) |>
ungroup()
for (i in seq_along(games_prior$season)) {
games <- add_row(games, slice(games_prior, i), .before = 2 * i - 1)
}
games
}
#Add NA's between each pair.
games <- insert_row_every_3rd(merge(schedule, n),
c(2024, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA))
#Group pairs via the NA's
games |>
mutate(pairs = cumsum(is.na(longitude)))
# Address the first game of the season in the 8 days between set.
if (n == 8) {
games <- games |>
filter(games$gamedate > as.Date("2024-10-28"))
}
games
}
# Plot background map for the travel graph.
map <- map_data("state")
plot <- ggplot() +
geom_polygon(data = map, aes(long, lat, group = group), fill = "lightgrey") +
theme_minimal() +
geom_path(data = schedule, aes(longitude, latitude))
# Add traces of different color over paths corresponding to each possible number
# of days between that game and the prior game
plot <- ggplotly(plot) |>
style(hoverinfo = "none", traces = 1) |>
add_trace(type = "scatter", data = n_days_of_rest(schedule, 8), x = ~longitude,
y = ~latitude, mode = "lines", split = ~pairs, color = "purple",
line = list(width = 16), name = "8 Days of Rest", showlegend = TRUE,
hoverinfo = "none") |>
add_trace(type = "scatter", data = n_days_of_rest(schedule, 5), x = ~longitude,
y = ~latitude, mode = "lines", split = ~pairs, color = "green",
line = list(width = 12), name = "5 Days of Rest", showlegend = TRUE,
hoverinfo = "none") |>
add_trace(type = "scatter", data = n_days_of_rest(schedule, 4), x = ~longitude,
y = ~latitude, mode = "lines", split = ~pairs, color = "blue",
line = list(width = 11), name = "4 Days of Rest", showlegend =
TRUE, hoverinfo = "none") |>
add_trace(type = "scatter", data = n_days_of_rest(schedule, 3), x = ~longitude,
y = ~latitude, mode = "lines", split = ~pairs, color = "yellow",
line = list(width = 10.5), name = "3 Days of Rest", showlegend = TRUE,
hoverinfo = "none") |>
add_trace(type = "scatter", data = n_days_of_rest(schedule, 2), x = ~longitude,
y = ~latitude, mode = "lines", split = ~pairs, color = "orange",
line = list(width = 6.5), name = "2 Days of Rest", showlegend = TRUE,
hoverinfo = "none") |>
add_trace(type = "scatter", data = n_days_of_rest(schedule, 1), x = ~longitude,
y = ~latitude, mode = "lines", split = ~pairs, color = "maroon",
line = list(width = 3), name = "1 Day of Rest", showlegend = TRUE,
hoverinfo = "none") |>
# Mark the cities of each team.
add_markers(type = "scatter", data = locations, x = ~longitude, y = ~latitude,
mode = "markers", hoverinfo = text, text = ~team,
marker = list(color = "black", size = 10), name = "",
showlegend = FALSE) |>
# Add info underneath the legend.
layout(showlegend = TRUE, legend = list(title = list(text = "Key")),
annotations = list(
list(xref = "paper", yref = "paper", x = 1.49, y = -0.65, showarrow = FALSE,
text = paste("Number of Games:", length(schedule$season))),
list(xref = "paper", yref = "paper", x = 1.49, y = -0.75, showarrow = FALSE,
text = paste("Dates:", min(schedule$gamedate), "to", max(schedule$gamedate))),
list(xref = "paper", yref = "paper", x = 1.49, y = -0.85, showarrow = FALSE,
text = paste("Number of Days:", max(schedule$gamedate) - min(schedule$gamedate))),
list(xref = "paper", yref = "paper", x = 1.49, y = -0.95, showarrow = FALSE,
text = paste("Number of Flights:", sum(schedule$home == 0 | (schedule$home == 1 & (lag(schedule$home != 1) | is.na(lag(schedule$home)))))))
),
xaxis = list(title = ""),
yaxis = list(title = ""))
# Add the difficulty score tracker.
# Use score for individual game, mean score, and 7 game rolling mean score.
plot2 <- ggplot(schedule) +
geom_line(aes(x = 1:80, y = difficulty_score), color = "blue") +
geom_line(aes(x = 1:80, y = mean(difficulty_score))) +
geom_line(aes(x = 1:80, y = rollmean(difficulty_score, k = 7, fill = NA, align = "right")), color = "red") +
labs(x = "Games", y = "Difficulty Score (Rest and Travel)")
# Modify legend and hover data information.
plot2 <- ggplotly(plot2) |>
style(name = "Difficulty Score", showlegend = TRUE, hoverinfo = text, text = ~y, traces = 1) |>
style(name = "Mean Difficulty Score", showlegend = TRUE, hoverinfo = text, text = ~y, traces = 2) |>
style(name = "Rolling Average (7 games)", showlegend = TRUE, hoverinfo = text, text = ~y, traces = 3) |>
layout(showlegend = TRUE, title = list(text = paste(schedule$team, "Schedule Visualization")))
# Combine the two plots.
subplot(plot, plot2, nrows = 2, titleX = TRUE, titleY = TRUE)
}
plotting_tool(okc_schedule2, locations)
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
## Warning: package 'plotly' was built under R version 4.4.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Warning: package 'zoo' was built under R version 4.4.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
plotting_tool(den_schedule2, locations)
QUESTION: Using your tool, what is the best and worst part of OKC’s 2024-25 draft schedule? Please give your answer as a short brief to members of the front office and coaching staff to set expectations going into the season. You can include context from past schedules.
ANSWER 8:
For the purposes of this question, the difficulty of each unique 7 game stretch of the season has been assessed. If there were a desire for the window to be larger or smaller that could easily be changed. The windows were identified with the rolling average line on the plotting tool difficulty score graph.
To begin, looking at all of the 7 game stretches, the worst part of the season would be games 54 to 60: from February 23, 2025 at MIN to March 5, 2025 at MEM. This is a stretch of 7 games in 11 days, with 2 back-to-backs and 5 games off of 1 day of rest. This high level of density is repeated elsewhere in the schedule, but the tiebreaker in this case is the fact that each game is at a different location than the last with no consecutive home games, requiring more flights and longer travel times.
On the other hand, directly prior to the worst part, the best part of the season would be games 47 to 53: from February 5 vs PHO to February 21 at UTA. This is a stretch of 7 games over 17 days, with 2 back-to-backs, 4 games off of 1 day of rest, and 1 game after the all star break. The best 7 game stretch of the season was inevitably going to include either the first game of the season or the start of the season, which were weighted equally. Notably though, the games before or after both of those 2 games were all quite condensed. However, the first 7 games of the season included 1 more day of rest than the stretch selected as the best part of the season. The reason for this was that the first 7 games of the season required 5 flights instead of 4 flights, and those 5 flights were far longer as well. So the much reduced travel from games 47 to 53 was deemed to outweigh the 1 day of rest advantage of the first 7 games of the season.
QUESTION: Please estimate how many more/fewer regular season wins each team has had due to schedule-related factors from 2019-20 though 2023-24. Your final answer should have one number for each team, representing the total number of wins (not per 82, and not a per-season average). You may consider the on-court strength of the scheduled opponents as well as the impact of travel/schedule density. Please include the teams and estimates for the most helped and most hurt in the answer key.
If you fit a model to help answer this question, please write a paragraph explaining your model, and include a simple model diagnostic (eg a printed summary of a regression, a variable importance plot, etc).
# Modify the schedule to include longitude, latitude, days between games, distance
# between games, whether the game is the second leg of a back-to-back.
# Deal with NA values.
enhanced_schedule <- schedule |>
group_by(season, team, opponent) |>
# Longitude and Latitude
mutate(longitude = if_else(home == 0, locations$longitude[locations$team == opponent[1]],
locations$longitude[locations$team == team[1]]),
latitude = if_else(home == 0, locations$latitude[locations$team == opponent[1]],
locations$latitude[locations$team == team[1]])) |>
ungroup(opponent) |>
mutate(# Days between Games
days_between = gamedate - lead(gamedate),
days_between = replace(days_between, is.na(days_between), 10),
# Distance between Games
prev_longitude = lead(longitude),
prev_latitude = lead(latitude),
distance = distHaversine(pick(prev_longitude, prev_latitude), pick(longitude, latitude)) / 1609.344,
distance = replace(distance, is.na(distance), 0),
is_b2b = gamedate - lead(gamedate) == 1,
is_b2b = replace(is_b2b, is.na(is_b2b), FALSE)) |>
# Remove extraneous columns and sort by date
select(-prev_longitude, -prev_latitude) |>
arrange(gamedate)
# Create temporary schedule to facilitate opponent net rating calculation
temp_schedule <- enhanced_schedule |>
group_by(season, team, gamedate) |>
# Add points and possessions as columns. For both offense and defense
mutate(points = game_data$points[game_data$gamedate == gamedate & game_data$off_team == team],
possessions = game_data$possessions[game_data$gamedate == gamedate &
game_data$off_team == team],
defpoints = game_data$points[game_data$gamedate == gamedate &
game_data$off_team == opponent],
defpossessions = game_data$possessions[game_data$gamedate == gamedate &
game_data$off_team == opponent]) |>
ungroup(gamedate) |>
#Compute season net rating updated every game
mutate(ortg = cumsum(points) / (cumsum(possessions) / 100),
drtg = cumsum(defpoints) / (cumsum(defpossessions) / 100),
nrtg = ortg - drtg) |>
ungroup()
# Add opponent's net rating for the season to each game
enhanced_schedule <- enhanced_schedule |>
left_join(
temp_schedule |>
select(gamedate, team = opponent, opp_nrtg = nrtg),
by = c("gamedate", "team")
)
# Examine correlations with correlation matrix
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
data_filtered <- enhanced_schedule |>
ungroup() |>
select(win, home, days_between, distance, is_b2b, opp_nrtg)
data_filtered <- data_filtered |>
mutate(across(everything(), as.numeric))
correlation_matrix <- cor(data_filtered)
corrplot(correlation_matrix, method = "color", type = "full",
addCoef.col = "black",
number.cex = 0.7)
# Remove days between the game and prior game since it was the weakest correlation
schedule_model <- glm(win ~ home + distance + is_b2b + opp_nrtg, data = enhanced_schedule, family = binomial)
# Filter to correct seasons
enhanced_schedule <- enhanced_schedule |>
filter(season >= 2019 & season <= 2023) |>
ungroup(season)
# Now calculate number of wins gained and lost for each team.
schedule_vars <- c("home", "distance", "is_b2b", "opp_nrtg")
# Find average values for each variable
baseline_vals <- enhanced_schedule |>
summarize(across(all_of(schedule_vars), ~ mean(.x, na.rm = TRUE)))
# Counterfactual dataset with neutral schedule
schedule_cf <- enhanced_schedule
for (i in schedule_vars) {
schedule_cf[[i]] <- baseline_vals[[i]][1]
}
schedule_cf$is_b2b <- FALSE
# Use the model on both datasets
pred_actual <- predict(schedule_model, newdata = enhanced_schedule, type = "response")
pred_cf <- predict(schedule_model, newdata = schedule_cf, type = "response")
# Subtract between the two win totals
results <- enhanced_schedule |>
ungroup() |>
mutate(pred_actual = pred_actual,
pred_cf = pred_cf,
diff = pred_actual - pred_cf) |>
group_by(team) |>
summarise(
expected_wins_actual = sum(pred_actual, na.rm = TRUE),
expected_wins_cf = sum(pred_cf, na.rm = TRUE),
schedule_effect = sum(diff, na.rm = TRUE)
) |>
ungroup() |>
mutate(
schedule_effect = schedule_effect - mean(schedule_effect, na.rm = TRUE)
) |>
arrange(desc(schedule_effect))
print(results)
## # A tibble: 30 × 4
## team expected_wins_actual expected_wins_cf schedule_effect
## <chr> <dbl> <dbl> <dbl>
## 1 MIL 201. 201. 5.95
## 2 LAC 200. 200. 5.36
## 3 BOS 199. 200. 4.42
## 4 UTA 198. 200. 3.61
## 5 PHX 199. 201. 3.54
## 6 DEN 198. 201. 2.99
## 7 DAL 199. 202. 2.79
## 8 MIA 198. 201. 2.62
## 9 BKN 197. 200. 1.91
## 10 TOR 196. 200. 1.63
## # ℹ 20 more rows
# Print top and bottom 3
cat("\nMost helped:\n")
##
## Most helped:
print(slice_max(results, schedule_effect, n = 3) |>
select(team, schedule_effect))
## # A tibble: 3 × 2
## team schedule_effect
## <chr> <dbl>
## 1 MIL 5.95
## 2 LAC 5.36
## 3 BOS 4.42
cat("\nMost hurt:\n")
##
## Most hurt:
print(slice_min(results, schedule_effect, n = 3) |>
select(team, schedule_effect))
## # A tibble: 3 × 2
## team schedule_effect
## <chr> <dbl>
## 1 DET -6.37
## 2 SAS -5.73
## 3 ORL -5.71
ANSWER 9:
Model Explanation (refer to correlation map above): My model took into account whether a game was at home, whether a game was a back to back, how far the team travelled to get to the game, and the opponent’s net rating to predict whether a team would win the game.